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Summary. We present a concept to achieve non-repudiation for nat- 
ural language conversations over the Internet. The method rests on 
chained electronic signatures applied to pieces of packet-based, digital, 
voice communication. It establishes the integrity and authenticity of the 
bidirectional data stream and its temporal sequence and thus the secu- 
rity context of a conversation. The concept is close to the protocols for 
Voice over the Internet (VoIP), provides a high level of inherent security, 
and extends naturally to multilateral non-repudiation, e.g., for confer- 
ences. Signatures over conversations can become true declarations of 
will in analogy to electronically signed, digital documents. This enables 
binding verbal contracts, in principle between unacquainted speakers, 
and in particular without witnesses. A reference implementation of a 
secure VoIP archive is exhibited. 

1 Introduction 

The latest successful example for the ever ongoing convergence of information 
technologies is Internet based telephony, transporting voice over the Internet 
protocol (VoIP) . Analysts estimate a rate of growth in a range of 20% to 45% an- 
nually, expecting that VoIP will carry more than fifty percent of business voice 
traffic (UK) in a few years [1]. The success of VoIP will not be limited to cable 
networks, convergent speech and data transmission will affect next generation 
mobile networks as well. The new technology raises some security issues. For 
eavesdropping traditional, switched analogue or digital phone calls, an attacker 
needs physical access to the transport medium. Digital networks are generally 
more amenable to attacks, as holds already for ISDN and to a yet greater extent 
for IP networks. Efforts to add security features to VoIP products are gener- 
ally insufficient, though proposals exist for the protection of confidentiality and 
privacy. Secure VoIP protocols, using cryptographic protection of a call, would 
even be at an advantage compared to traditional telephony systems. Protocols 
like SRTP [2] can provide end-to-end security to phone calls, independently of 
the security of transport medium and communication provider. 

With VoIP maturing, it becomes natural to ask for appHcation-level security 
in the context of IP telephony. Our purpose is to achieve non-repudiation in this 
context, i.e., for speech over packet-oriented, digital channels, and in particular 
for VoIP conversations. This means the capability to produce tenable evidence 
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that a conversation with the alleged contents has taken place between two or 
more parties. Ancillary information, e.g., that the conversation partners have 
designated, personal identities, and the time at which the conversation has 
taken place, may be of utmost importance in this regard, either to establish 
a supporting plausibility, e.g., 'caller was not absent during the alleged call', 
or to express relevant semantic information, e.g., 'telephonic order came in 
before stock price rose'. For electronic documents this kind of non-repudiation 
is commonly achieved by applying electronic signatures based on asymmetric 
cryptography. In the communication between several parties, the desired result 
is a binding contract, and in analogy the central goal of the present contribution 
is a technology to establish binding verbal contracts without witnesses. 

This subject has a long pre-history: As early as 1905, Edison proposed the 
recording of voice, which was patented 1911 [3]. With the advent of digital 
signature technology, Mcrkle [4] envisioned, referring to DifRc and Hcllman that 
"Digital signatures promise to revolutionize business by phone". However, work 
on non-repudiation of digital voice communication is scarce. The work most 
closely related to ours is the proposal in [5] , resting on the theory of contracts 
and multi-lateral security [6]. It comprises a trusted third party ('Tele- Witness') 
that is invoked by communicating parties to securely record conversations and 
make them available as evidence at any later point in time. 

Non-repudiation of inter-personal communication is interesting because of 
its inherent evidentiary value, exposed by forensic evaluation of the contained 
biometric data, e.g., as an independent means of speaker identification [7, 8]. 
Methods for the latter are advanced [9], yielding to recorded voice a high pro- 
bative force, e.g., in a court of law. In comparison to other media, specific 
features of voice contribute to non-repudiation. Voice communication is inter- 
active [10] and enables partners to make further enquiries in case of insufficient 
understanding. This mitigates to some extent problems to which signed digital 
documents are prone, e.g., misinterpretations due to misrepresentation, lack of 
uniqueness of presentation, and inadvertent or malicious hiding of content [11]. 

We set out requirements for non-repudiation which are very particular in the 
case of VoIP and other multi-media communication over IP, in Section 2 and 
propose the method to meet them in Section 3. Section 4 analyses the security 
of the method by Hsting and assessing the auditable information secured by it. 
Section 5 describes the implementation of a secure VoIP archive. Conclusions 
and an outlook are found in Section 6. A definition of and criterion for mul- 
tilateral non-repudiation, used in Section 3.4, are provided in the Appendix. 

2 Requirements for non-repudiation of conversations 

From the schematic characterisation of non- repudiation in the standards [12, 
13], we focus on the secure creation of evidence for later forensic inspection. 
This overlaps with the basic information security targets integrity and avail- 
ability of the well-known CIA triad. To account for the particularities of the 
channel, we here take a communication-theoretical approach to derive require- 
ments for non-repudiation. The general characteristics of the class of electronic 
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communication that we address are the same for a wide media range, comprising 
audio, video, and multi-media. In essence it is always a full duplex or multiplex 
channel operating in real time using data packets, and we subsume communi- 
cation over those under the term conversation. Generic requirements for the 
non-repudiation of conversations can be profiled for specific media, and we 
sometimes exemplary allude to the case of speech and VoIP. They are grouped 
around the top level protection targets congruence and cohesion. We describe 
the latter and devise for each a minimal set of specific, but application- and 
technology-neutral requirements. The requirements are necessary preconditions 
to achieve the protection targets, and are ordered by ascending complexity. 

Tl Congruence. Communication theory and linguistics have established 
that the attributions of meanings can vary between a sender and a receiver of 
a message [14, Chapter 6], [15] — a basic problem for non- repudiation. Apart 
from the ambiguity of language, this implies particular problems for electronic 
communication channels and media. For digital documents bearing electronic 
signatures, the presentation problem is addressed by invoking the 'What You 
See is What You Sign' (WYSIWYS [11]) principle. It is often tacitly assumed 
that presentation environments can be brought into agreement for sender and 
receiver of a signed document [16]. We term this fundamental target 'congru- 
ence '. It has special traits in the case of telephony. Essential for non-repudiation 
is the receiver's understanding, which leads in analogy to the principle 'What Is 
Heard Is What Is Signed'. But additionally it is indispensable to assure senders 
(speakers) about what precisely was received (heard). 

Rl.l Integrity of the data in transmission, including technical environ- 
ments for sending and receiving them. For VoIP, this is to be addressed at the 
level of single RTP packets and their payloads and of an entire conversation. 

R1.2 Treatment of losses in the channel must enable information of 
senders about actually received information. This is independent of methods for 
avoidance or compensation of losses, such as Packet Loss Concealment (PLC). 
Rather it means a secure detection of losses (enabled by fulfilled Rl.l), enabling 
a proper handling on the application level as well as a later (forensic) inspection. 

R1.3 User interaction policies and their enforcement finally use fulfilled 
Rl.l and 1.2 to ensure congruence in the inter-personal conversation. For elec- 
tronic documents this can simply amount to prescriptions about the technical 
environments in which a electronically signed document must be displayed. Or 
it can be an involved scheme to guarantee the agreement of contents of docu- 
ments undergoing complex transformations [17, 18], e.g., between data formats. 
For speech, it can be realised in various ways taking into account the interactive 
nature of the medium. This is elaborated on in Section 3.5. 

T2 Cohesion regards the temporal dimension of conversations. It means in 
particular the protection and preservation of the sequence the information fiows 
in all directions of the channel. Again this is at variance with signed documents, 
where temporal sequence of communication is immaterial. Cohesion means to 
establish a complete temporal context of a conversation usually even in absolute 
time, since the temporal reference frame of a conversation can be meaningful. 
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R2.1 Start times of conversations must be determined and recorded. This 
is analogous to the signing time of documents (the assignment of which is a 
requirement for quaUfied signatures according to the EU Signature Directive) . 

R2.2 Temporal sequencing of conversations must be protected and re- 
lated to the reference time frame established by fulfilling R2.1. 

R2.3 Continual authentication of communication devices and if possible 
even communication partners is necessary, e.g., to prevent hijacking. 

R2.4 Determined break points must allow for non-repudiation of con- 
versations until they are terminated intentionally or inadvertently. 

From the requirements analysis it is apparent that congruence and cohesion 
are complementary but not orthogonal categories. A specific profile for VoIP is 
not formulated here for brevity, but rather included in the development of the 
method below. It is understood that additionally the known standard require- 
ments for electronic signatures as declarations of will and for non-repudiation 
of electronically signed documents, which are rooted in the theory of multi- 
lateral security [19], must be taken into account. We do not address details of 
user authentication, consent to recording, general privacy, confidentiality, and 
interaction with respect to the signing as a declaration of will proper. Nonethe- 
less, the method proposed below enables the secure recording and archiving to 
preserve the probative value of a conversation, as demonstrated in Section 5. 

3 The method 

The requirements (R2.4) entail that signing a entire conversation with a single 
RSA signature by A is not viable, since this yields full disposal to determine 
(maliciously) the end time of signing of a conversation, and deprives B of any 
possibility to control and verify this during conversation. The opposite approach 
to secure single packets does not assure cohesion (R2.2 in conjunction with 
Rl.l), since single RTP-packets contain only little audio data which may then 
easily be reordered. Apart from that, it would be computationally expensive. 
This is the prime motivation for the method we now present in general for the 
case of a bilateral conversation between A and B, using, e.g., the SIP/RTP 
protocol combination [20, 21]. In a basic model A secures the conversation 
as an unilateral declaration of will. We proceed in a bottom-up fashion from 
the base concept of intervals of VoIP data, over securing their integrity by a 
cryptographic chain, to coping with inevitable packet loss. For later reference we 
call the technique presented in 3.1 — 3.4 below the interval-chaining method. 

3.1 Building intervals 

Intervals are the logical units on which the protection method operates. In- 
tervals span certain amounts, which may be nil, of RTP packets for only one 
direction. As bi-directional communication needs formation of intervals for both 
directions, A and B hold buffers for packets both sent and received. Since di- 
rections arc handled differently w.r.t. packet loss, as described in Section 3.3, 
directionally homogeneous intervals are advantageous from a protocol design 
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viewpoint. To resolve the full duplex audio stream into an interval sequence 
we determine that intervals in the directions from and to A alternate. Intervals 
are enumerated as hk-i, hk, k = 1, . . . ,N ior directions A ^ B and B ^ A, 
respectively. Interval /; comprises RTP packets ipi.j), j = l,...,Ki, sent or 
received by A. For the moment wc assume that there is no packet loss. 

The length of an interval (in appropriate units) is a main adjustable param- 
eter, and an important degree of freedom. Adjustable sizes of, e.g., data frames 
are not very common in communication technology, but recent proposals [23] 
show that they can be advantageous in certain situations, like the present one. 
We determine that interval boundaries are triggered by the elapse of a certain 
time, called interval duration and denoted by D. If T is the duration of the con- 
versation then N = [T/D]. Basing intervals on time necessitates the formation 
of intervals without voice data payload when a silence period exceeds D. This 
design choice entails some signalling, transport, and cryptographic overhead. 
This is however outweighed by some favourable properties. In particular, the 
maximum buffer length is known from the outset, and control of the interval 
duration is a direct means to cope with the (known) slowness of (pubhc key) 
cryptographic soft- and hardware. Adjustment of D therefore allows for an, 
even dynamical, trade-off between security and performance, as it controls the 
ratio of security data to payload data. The alternative of triggering intervals by 
full-run of packet buffers at both sides causes concurrency problems. 

Since the communication channel is fully duplex, the sequence of intervals 
does not reflect the temporal sequence of audio data, rather hk-i and hk 
comprise approximately concurrent data sent in both directions. But this is 
immaterial since intervals are only logical units and security data for intervals 
can be stored separately from the RTP streams. This is a key feature of our 
method. It does not affect the VoIP communication at all but can be run in 
complete — logical and even physical (extra hardware) — separation from it. 
VoIP communication is therefore not impeded by our method. 

3.2 Cryptographic chaining 

The basic idea is to cryptographically secure the payload contained in each 
interval and include the generated security data in the subsequent interval to 
form a cryptographic chain. We use the shorthand {■)x = Privjc (ft.(-)) for entity 
X' digital signature by applying a private key Privx and a hash algorithm h{-). 
TS is a time-stamping authority. The notation — > signifies the sending of some 
data. To sign a conversation A performs the following operations. 

Sec/: M/ = (L>,SIP_Data,Auth_Data, nonce,...) — > B; 

So = {{Mi)a)^s^B; 
Sec/ : Si = {Ii,Si^i)a B; l = l,...,2N 
Secp- Mp = (termination_condition, . . .) — > B; 
Sf = {{Mp,S2n)a)ts^B; 
In the initial step Sec/, (•)ts means a time-stamp applied by TS, e.g., according 
to RFC 3161 [22], and is enveloping the meta-data M/ signed by A (R2.1). This 



6 Nicolai Kuntze, Andreas U. Schmidt, and Christian Hett 



may include some authentication data Auth_Data , e.g., ^'s digital certificates. 
To provide a broad audit trail for later inspection, data from the call nego- 
tiation and connection estabHshment, here subsumed under SI P_ Data, should 
be included. The final time-stamp can be used optionally to detect drift, and 
narrows down the conversation in time. Since this is sufficient to secure the 
temporal context required for cohesion, the application of time-stamps in every 
step, which may be costly, is not proposed. A nonce is included in Mj to pre- 
vent replay attacks. By including Si-i in the signed data Si and S2N-1 in Sf, 
and alternation of interval directions, Rl.l and R2.2 are satisfied. Signatures 
of A and additional authentication data in Mj support R2.3. If communication 
breaks inadvertently, interval chaining is verifiable up to the last interval, thus 
R2.4 is satisfied, with a loss of at most one interval duration of conversation 
at its end. A controls interval timing and the operations Sec/, Sec;, and Secp 
occur at times 0, [Z/2J • D, and N ■ D, respectively. 

3.3 Treatment of packet loss 

Digital voice communication offers a rather high reliability leading generally to 
a higher understandability of VoIP communication in comparison with all pre- 
decessors. However, packet loss may occur and must be treated as explained in 
R1.2. Denote by 61 C {1, Ki} the sequence of identifiers of packets actually 
received by A respectively B. Intervals are reduced accordingly to // = {pi,j)jeSi ■ 
The steps Sec( are modified by a protocol to report received packages. 

Sec2fe_i : repeat 

repeat 

interval_termination — > B; 
until d2k-i — > A; 
until S2k-i = {l2k-i^ ^2k-2)A — >■ B; 
Sec2k '■ repeat 

S2k (-^2fci S2k-l)A — > B; 

hk — > B\ 
until success; 

This accounts for losses in the VoIP (RTP) channel as well as failures in the 
channel for transmission of signing data. The loop conditions can be evaluated 
by explicit (Sec2fc) or implicit (Sec2fe_i) acknowledgements by receivers. 

3.4 Extension to multilateral conversations 

Here we present the simplest way to extend the method above to conference- 
like situations. Multilateral non- repudiation means mutual agreement about the 
contents of a conversation between all parties as defined in the Appendix. For 
implementing it for M participants Aq, . . . , Am-i a round-robin scheme [24] 
can be used to produce the required chain of signatures as in Lemma 1. Round- 
robin is a simple algorithm to distribute the required security data between the 
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participants of the conference. Other base algorithms of distributed systems 
Hke flooding, echo, or broadcast might be used, depending, for instance, on the 
particular topology of the conference network. During the round, a token is 
passed from participant to participant, signalling the signer role. If participant 
Am carries the token, he waits for time D and buffers packets sent by himself. 
When Am terminates the interval a signalhng and signing protocol is processed, 
which, in contrast to the scheme above, only concerns data sent by Am- The 
numbering of intervals is as follows. In the time span from to D the packets 
{Pm;j) sent by Am are in the interval Im- The packets emitted by Am during 
[D, 2D] are in Im+tti, and so on. It is here not feasible to sign merely the packets 
received by everyone, because cumulative packet loss could be too high. Instead, 
an additional hashing indirection is included and hashes = {h{pk;j))je0 of ah 
packets 9 received by at least one person from Am in interval k are distributed 
and can be used to check the signature in spite of packet loss. Let denote the 
list of packets sent by Am and received by A^ in interval k. Set Rm = {0, .., M — 
l}\m and let r > be the round number. In order to account for latencies in 
reporting of packet loss, computing hashes, and signing, we introduce a parallel 
offset in the round-robin scheme. In round r participant Am carrying the token 
terminates interval with number fc(r, m) == rM^ + (M+ l)m+ 1. He secures the 
set of intervals T{r, m) = (k{r, m) - M • {0, . . . , M - 1}) n N. 
Sec_mult^ „j : Va G Rm do 

repeat 

interval_termination — > Aa', 

until i6k)kenr,m) ^rn-, 

od; 



Ok = U^eR,n^k for fce/(r,m); 

^r,m ~ (-^r,m) '^pred(r,m))A^ 5 

V(t' G Rm do 
repeat 

until success; 
od; 

The preceding security value Spred{r,m) bears indices 

{(r, m — 1) if m > 1; 

(r-l,M-l) ifr>l, m = 0; 
/ otherwise, 
where / stands for the initiahsation interval which can be constructed as in 
the preceding sections, replacing single sending by broadcast with acknowl- 
edgements. The numbering scheme for Intervals and the evolving sequence of 



8 Nicolai Kuntze, Andreas U. Schmidt, and Christian Hett 



D 


2D 


3D 


4D 


5D 


6D 


7D 


8D 


9D 




5 


9 


13 




21 


25 


29 




2 




10 


14 




^22^ 


26 


30 




3 


7 




15 / 


19 


23 


^27^ 




35 


4 


8 


12 


^ 16 / 


20 


24 


28 


^ 32'' 





Fig. 1. Numbering of intervals in the case of 4 participants along the time axis. Arrows 
indicate the sequence of security values S. Thicker borders separate rounds. Equally 
coloured intervals are secured in a single operation Sec_mult^^. 

security values is shown in Figure 1 below. In effect, Am broadcasts (with ac- 
knowledgement) a signature over hashes of all packets received by at least one 
other participant. This is the common security data with which the chain can 
be continued. According to Lemma 1, non-repudiation of the total, multilateral 
conversation for the first interval duration from time to Z? is achieved after 
execution of Sec_mult2 at time 2M • D. With each further execution of 
Sec_mult a subsequent piece of conversation of length D obtains multilateral 
non- repudiation. 

In case of call termination, 2M + 1 finaHsation steps without audio data (two 
final rounds plus finishing by the participant carrying the token at the time of 
termination) are required to obtain non-repudiation of the last interval in time. 
Joining and leaving a signed multilateral call while the signature is created by 
the participants can be enabled through finalisation. If participant B requests 
to join the call. Am, who posses the token, initiates a finaHsation and B can 
join after this (inserted as m -t- 1). In the case that a participant likes to leave 
he awaits the token and finalises including a leave message. 

3.5 Operational policies 

We do not lay out a complete set of rules for the operation of a system using 
the non-repudiation method above. Rather we list the most obvious ones and 
stress the most important point of monitoring and treatment of packet loss, or 
rather understandability. 

To account for requirement R1.3, users should be signalled at any time 
during a conversation about the signature status of it. This necessitates to an 
extent specified by application-specific policies the cryptographic verification 
of the interval chaining, and continual evaluation of relevant information, see 
Section 4.1. Additionally a secure voice signing terminal should control every 
aspect of user interaction and data transmission. This is elucidated in [25]. 

To maintain congruence and mitigate attacks aiming at mutilating a conver- 
sation, packet loss and the ensuing level of understandability must permanently 
be monitored. When the packet loss is above a configurable threshold, an action 
should be triggered according to determined policies. The principle possibilities 
are: 1. ignore; 2. notify users while continuing signing; 3. abort the signing; and 
4. terminate call. The first two options open the path for attacks. Termination 
of the call is the option for maximum security. From a practical viewpoint, the 
loss threshold is seldom reached without breakdown of the connection anyway 
due to insufficient understandability or timeouts. 
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Options 3 and 4 provide a 'Sollbruchstelle' (predetermined break point) for 
the probative value of the conversation. In contrast, most other schemes for 
securing the integrity of streamed data, e.g., the signing method of [26] aim at 
loss-tolerance, for instance allowing for the verification of the stream signature 
with some probability in the presence of packet loss. We suggest that for the 
probative value of conversations, the former is advantageous. A signed call with 
an intermediate gap can give rise to speculations over alternatives to fill it, which 
are restricted by syntax and grammar, but can lead to different semantics. Using 
this, a clever and manipulative attacker could delete parts of the communication 
to claim with certain credibility that the remnants have another meaning than 
intended by the communication partncr(s). If the contents of a conversation 
after such an intentional deletion are unverifiable and thus cannot be used to 
prove anything, this kind of attack is effectively impeded. 

4 Security considerations 

We corroborate the statement that interval chaining can achieve non- repudiation 
for VoIP conversations, based on the information generally secured by interval 
chaining. An analysis based on an instance of a system architecture (the VoIP 
archive presented in Section 5 below) and possible attacks is contained in [27]. 

4.1 Auditable information 

In this section we analyse the information that can be gained and proved to 
have integrity in a call secured by interval chaining. Table 1 gives a, perhaps 
incomplete, overview over this audit data, which may be amenable to foren- 
sic inspection, e.g., by an expert witness in court, or, on the other extreme, 
appHcable during the ongoing conversation, or both. 

4.2 Comparison with SRTP and IPsec 

The well-known security methods SRTP and IPsec address the protection of 
confidentiality, authenticity and data integrity on the application, respectively 
network layer, and can be applied to VoIP and as well in parallel with interval- 
chaining. We want to show salient features of interval-chaining, which distin- 
guishes it from both standards and in our view provides a higher level of non- 
repudiation and even practicality. On the fundamental level, both SRTP and 
IPsec necessarily operate on the packet level and do not by themselves pro- 
vide protection of the temporal sequence and cohesion of a VoIP conversation. 
While it is true that pertinent information can be reconstructed from the RTP 
sequence numbers, in turn protected by hash values, such an approach would 
have some weaknesses, which taken together do not allow full non-repudiation. 
In particular, RTP sequence numbers can suffer from roll-overs and though their 
integrity is secured in transmission, they can still be rather easily be forged by 
the sender, since they belong to protocol stacks which are not especially se- 
cured in common systems. While packet loss can be detected or reconstructed 
using sequence numbers, interval chaining yields a well-defined, tunable, and 
cryptographically secured means to deal with it during an ongoing conversation. 
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Table 1. Auditable information of a conversation secured with interval chaining. 
Columns: Secured data item audited. Non-repudiation requirement addressed, Pro- 
tection target supported. Actual information indicated or verified, and when is the 
check applicable. 



significantly limiting potential attack vectors. In essence, RTP sequence num- 
bers are not designed to ensure a conversation's integrity and thus have lower 
evidentiary value in comparison to chained intervals. From the viewpoint of 
electronic signatures, their level of message, respectively, conversation authen- 
tication can only be achieved with an protocol-independent means to manage 
authentication data such as asymmetric keys, i.e., a Public Key Infrastructure. 
The connection and session dependent key handling of IPsec and SRTP, relying 
on HMACs and merely allowing for symmetric keys deprived of authentication 
semantics, are generally insufficient for non-repudiation. Interval chaining is an 
independent means to control the cryptographic workload benefiting scalability. 
Finally, NAT traversal is a problem for network layer intcgritj^ protection like 
IPsec since rewriting IP headers invalidates corresponding hash values (a solu- 
tion has been proposed by TISPAN [28]). This problem does not occur with the 
interval chaining method, since only RTP headers, not IP headers of packets 
need to be (and are in the implementation below) signed. 

5 Application to a secure VoIP archive 

In this section we present an efficient self-signed archive for VoIP calls and its 
system architecture. It was implemented as a prototype together with a verifi- 
cation and playback tool, requires no modification to the terminal equipment, 
and secures the ongoing conversations 'on the fly'. Section 5.1 was partially 
published in [27]. It uses timestamps to secure the exact starting-time of a con- 
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versation and not the moment of archiving. The main design principle is that 
of minimal technical requirements on the communication clients. Figure 2(a) 
shows the communication between two partners A and B over a VoIP channel. 
The security component VSec, the component implementing the interval chain- 
ing, can listen to the communication at any point in the channel. Neither the 
exact position in the channel nor the technical method by which VSec intercepts 
it is essential for the architecture and its security properties. It can be located 
at the site of either of the parties A or B , e.g., in the case of call-centre ap- 
plications. The channel is not required to be digital, let alone SIP/RTP based, 
end-to-end, provided that there is some part of the channel which is VoIP. This 
condition is already met in many mobile and public switched networks. Accord- 
ingly, the phones used by A and B need not be ISDN or VoIP phones. VSec will 
often be under the control of one of the parties or even be integrated in their 
VoIP infrastructure. 
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Fig. 2. (a) High-level modular architecture of a secure voice archive, (b) The verifi- 
cation and playback tool performing checks on the archived call. 

The role of VSec can be passive or dual, for instance to enforce policies, 
cf. Section 3.5. The separation of such a component is standard in security 
engineering where it is commonly known as a reference monitor [29]. The com- 
ponent Arc denotes the archive to which the secured VoIP communication is 
submitted and then persistently stored, with due consideration of long-term se- 
curity. Tl and T2 are time-stamping authorities which come into play to raise 
resilience against attacks, e.g., if VSec, Arc or both are compromised. 

5.1 Implementation 

The archive system has been implemented as a prototype and tested with sev- 
eral soft phones and devices, e.g., AVM's FritzlBox [30] using SIP/RTP. For 
B we used mobile phones, ISDN phones, and also SIP software clients. VSec 
was implemented using C#, running on an embedded x86 based PC without 
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keyboard, mouse or video ports on Linux OS with the Mono-framework. It was 
placed as an outbound proxy, substituting ^'s original one, between A and the 
Internet using its two NICs, thus supporting multiple concurrent clients and 
calls. The proxy modifies RTP ports and IP addresses contained in the SIP 
packets redirecting them to itself and forwards them to the original recipients. 
A traditional PC was used for Arc, connected over a rehable TCP channel (TLS 
can be used for privacy) using the third NIC of VSec. 

Intervals are formed as in Section 3.2 and 3.3, though packet loss is here 
handled by simply storing a list of the received packets. This is an obvious 
necessity as the terminal equipment cannot be modified in a pure archiving sce- 
nario. VSec carries an X.509 certificate together with the private (RSA) key to 
sign intervals, including So and Sf containing meta data. The signing operation 
{■)a is extended beyond the presented formal scheme by not using, e.g., RSA 
signatures directly, but instead PKCS // 7 signed envelope (where data is to be 
stored together with its signature) and PKCS#7 detached signatures (where 
signatures are stored alone). This has the advantage to store and transport 
certificates and certificate chains in combination with the data. For storage ef- 
ficiency, only the first interval's PKCS#7 signed envelope container contains 
the whole certificate chain up to, but not including the root, while ah other 
containers do not need to carry this redundant information. So is also wrapped 
in a signature from the time stamp service Tl. 

The certificate and private key of VSec are not only used to sign the con- 
versation, but also to authenticate VSec toward Arc. Each completed interval 
is immediately transmitted to Arc, which then performs several tests, see Ta- 
ble 1, including signature verification. It then stores it as chunks into an open 
file. This yields resilience even against severe attacks, e.g., a compromised VSec. 
Arc checks the trusted third party time stamp from Tl immediately at the start 
of the cah, thus the five archiving cannot be delayed, and also checks each in- 
terval as soon as it arrives. Therefore, a compromised VSec would have to forge 
conversations in duration of an interval, i.e., in real time, which is infeasible. 
On the other hand, a compromised Arc does not posses the certificate. Even if 
both collude, they were still be bound inside the two timestamps from Tl and 
T2, which could be compared, e.g., to an itemised bill from the phone company. 

The implemented system performed very well and was able to archive mul- 
tiple parallel calls on the mentioned hardware without notable load or memory 
pressure. The duration of an interval is one of the main configuration parame- 
ters to be tuned. One second proved to be sufficient to provide a high level of 
security for the context of the talk on the one hand, and on the other hand to 
keep the computational power required by far low enough for the used proces- 
sor, and also the storage overhead (400 bytes for PKCS#7 signatures without 
embedded certificates) to payload ratio small. 

6 Conclusions 

IP-based multimedia communication is not restricted to VoIP, for instance by 
now, several video conferencing systems are maturing, some of which are based 
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on sophisticated peer-to-peer communication [31, 32]. Moreover the service qual- 
ity and availability of the new communication channels is constantly increasing 
through developments like packet loss concealment (PLC) for audio [33] and 
even video [34, 35] streams. Our proposed method for non-repudiation is appli- 
cable in all these contexts. Its adoption would pave the way for a new paradigm 
for trustworthy, inter-personal communication. 

The next steps in our research are to i) implement the operational con- 
text for electronic signatures over speech, i.e., user interaction and signalling, 
ii) devise a trustworthy signature terminal for that purpose, preferably using 
Trusted Computing technology on mobile devices, e.g., to secure audio I/O and 
processing, iii) extend the method to conferences and other media than VoIP. 

Appendix: Multilateral non-repudiation with signatures 

Contracts embodied in paper or electronic form often bear the attribute of 
mutual non-repudiation (in the case of two parties) if both A and B receive a 
copy of the doubly signed document, and know with certainty that the other 
party has it as well. To achieve mutuality in digital contexts, some kind of 
bi-lateral electronic signing must be performed. For electronic documents, it is 
well known that a simple, three-step protocol (called BAKO [36]), enveloping 
three signatures in the sequence ABA around the document, suffices. 

This section superficially formalises the notion of a multilateral agreement 
between parties Ai, . . . , Am on the fact Comm which is the logical assertion 
that a certain communication has taken place between them. A more rigorous 
treatment in terms of formal languages can be found in [37]. Assume that Comm 
can be acknowledged with certainty by Ai, i.e., A^ asserts his knowledge that 
no sequence of events can occur which negates Comm, by digitally signing it, 
i.e., by forming (Comm)^^. We write 



Multilateral non-repudiation is partial common knowledge of Comm in the sense 
that every party knows that every party knows Comm, i.e.. 



In extension of BAKO for mutuality we find 

Lemma 1. Multilateral non-repudiation of Comm is achieved for Ai, . . . ,Am 
by formation of 

(. . . (((. . . ((Comm)Ai)A2 • • ■)am)ai)a2 ■ ■ ■)am- 

Proof. Replacing B in the BAKO scheme ABA by A2. . . Am we see that for- 
mation of ((. . . ((Comm)^i)A2 • • ■)am)ai establishes 






The statement follows by iteration. 
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